Overfitting is bad for a machine learning model because it gives too much importance to the training data points and tries to fit a curve that passes through all the points. So, we need to reduce the importance given to these exact points and account for some sort of variance as well. This can be done by adding a penalty term to the loss function, discouraging the model from assigning too much importance to individual features or coefficients. This technique is called Regularization.
Regularization is an important technique in machine learning that helps to improve model accuracy by preventing overfitting, which happens when a model learns the training data too well, including noise and outliers, and performs poorly on new data. By adding a penalty for complexity, it helps simpler models perform better on new data.
There primarily exist 3 types of regularizations: Lasso (L1), Ridge (L2) and Elastic Net (L1-L2) regularizations. The only difference amongst them is what kind of penalty term is added.
There are several ways of controlling the capacity of Machine Learning models and Neural Networks to prevent overfitting:
L1 regularization is a relatively common form of
regularization, where for each weight w we add the
term
λ |w| to the objective. The L1 regularization has the
intriguing property that it leads the weight vectors to become sparse
during optimization (i.e., very close to exactly zero). In other words,
neurons with L1 regularization end up using only a sparse subset of
their most important inputs and become nearly invariant to the “noisy”
inputs. In comparison, final weight vectors from L2 regularization are
usually diffuse, with small numbers. In practice, if you are not
concerned with explicit feature selection, L2 regularization can be
expected to give superior performance over L1.
A regression model that uses the L1 Regularization
technique is called LASSO (Least Absolute Shrinkage and
Selection Operator) regression. It adds the absolute
value of the magnitude of the coefficient as a penalty term to
the loss function (L).
This penalty can shrink some coefficients to zero, which helps in
selecting only the important features and ignoring the less important
ones.
Cost = (1/n) ∑(i=1 to n) (yᵢ − ŷᵢ)² + λ ∑(i=1 to m) |wᵢ|
Where:
m - Number of Features
n - Number of Examples
yᵢ - Actual Target Value
ŷᵢ - Predicted Target Value
L2 regularization is the most common form of
regularization. It can be implemented by penalizing the squared
magnitude of all parameters directly in the objective. That is, for
every weight w in the network, we add the term
½ λw² to the objective, where λ is the
regularization strength. It is common to see the factor of
½ in front because then the gradient of this term with
respect to the parameter w is simply
λw instead of 2λw. The L2
regularization has the intuitive interpretation of heavily penalizing
peaky weight vectors and preferring diffuse weight vectors. As we
discussed in the Linear Classification section, due to multiplicative
interactions between weights and inputs this has the appealing property
of encouraging the network to use all of its inputs a little rather than
some of its inputs a lot. Lastly, notice that during gradient descent
parameter update, using the L2 regularization ultimately means that
every weight is decayed linearly: w += -lambda * w
towards zero.
A regression model that uses the L2 regularization technique is called Ridge regression. It adds the squared magnitude of the coefficient as a penalty term to the loss function (L).
Cost = (1/n) ∑(i=1 to n) (yᵢ − ŷᵢ)² + λ ∑(i=1 to m) (wᵢ²)
Where:
n = Number of examples or data points
m = Number of features i.e., predictor variables
yᵢ = Actual target value for the iᵗʰ example
ŷᵢ = Predicted target value for the iᵗʰ example
wᵢ = Coefficients of the features
λ = Regularization parameter that controls the strength of regularization
Animation of increasing levels of regularization strength of λ from 0 to 10, showcasing how the penalty surface shapes become increasingly extreme.
It is possible to combine the L1 regularization with the L2
regularization:
λ₁ |w| + λ₂w² (this is called Elastic net
regularization).
Elastic Net Regression is a combination of both L1 as well as L2 regularization. That shows that we add the absolute norm of the weights as well as the squared measure of the weights. With the help of an extra hyperparameter that controls the ratio of the L1 and L2 regularization:
Cost = (1/n) ∑(i=1 to n) (yᵢ − ŷᵢ)² + λ [ (1 − α) ∑(i=1 to m) |wᵢ| + α ∑(i=1 to m) (wᵢ²) ]
Where:
n = Number of examples (data points)
m = Number of features (predictor variables)
yᵢ = Actual target value for the iᵗʰ example
ŷᵢ = Predicted target value for the iᵗʰ example
wᵢ = Coefficients of the features
λ = Regularization parameter that controls the strength of regularization
α = Mixing parameter where 0 ≤ α ≤ 1
α = 1 → Lasso (L₁)
α = 0 → Ridge (L₂)
0 < α < 1 → Balance of L1 and L2
Ridge Regression (solid
lines) has an L2 penalty and shrinks coefficients towards 0, but very
rarely all the way to 0. Lasso Regression (dashed lines) has an L1
penalty and shrinks coefficients all the way towards 0, resulting in
sparser solutions.
4. Max norm constraints
Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint. In practice, this corresponds to performing the parameter update as normal, and then enforcing the constraint by clamping the weight vector w of every neuron to satisfy ‖w‖₂ < c. Typical values of c are on the order of 3 or 4. Some people report improvements when using this form of regularization. One of its appealing properties is that the network cannot “explode” even when the learning rates are set too high because the updates are always bounded.
5. Dropout
Dropout is an extremely effective, simple, and recently introduced regularization technique that complements the other methods (L1, L2, maxnorm). While training, dropout is implemented by only keeping a neuron active with some probability p (a hyperparameter), or setting it to zero otherwise.
References:
https://www.mdrk.io/regularization-in-machine-learning-part1/